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ABSTRACT 

The present study was an attempt to alleviate some of 
the difficulties inherent in multiple-choice items by having 
examinees respond to multiple-choice items in a probabilistic manner. 
Using this format, examinees are able to respond to each alternative 
and to provide indications of any partial knowledge they may possess 
concerning the item. The items used in this stisdy were 3C 
multiple-choice analogy items.- Examinees were asked to distributis 100 
points among the four alternatives for each item according to how 
confident they were that each alternative was the correct answer. 
Each item was scdred using five different scoring formulas. Three of 
these scoring formulas were reproducing scorjlng systems. Results from 
this study showed a small effect of certainty on the probabilistic 
scores in terms of the validity of the scores but no effect a* all on 
the factor structure or internal consistency of the scores. Once the 
effect of certainty on the probabilistic scores had been ruled out, 
the five scoring formulas were compared in terms of validity, 
reliability, and factor structure. There were no differences in the 
validity of the scores from the different methods. (Author/BW) 
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The present study «as an attempt to alleviate scwae of the dif f Iculi les 
inherent In muj tiple-choice itens by having examinees respond to n:u;tlple- 
clioice items In a probablxlitlc manner. Using this format, examinery are able 
to respond to each alternative and to provide indications of any partial 
knowledge they may possess 'Concerning the item. The items used In this; '^'udy 
were 30 multiple-choice analogy items. Exaaninees were asked to disrrii,.-ie 100 
points among the four alternatives for each itan according to how confident 
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they were that each alternative was the correct answer. Each itaa was scored 
using five different scoring formulas. Three of these scoring formulas— the 
spherical, quadratic, and truncated log scoring methods— were reproducing 
scoring systeos; The fourth scoring method used the probability assigned to 
the correct alternative as the item score, and the fifth used a function of 
the absolute difference between the coiprect response vector for the four 
alternatives and the actual points assigned to each alternative as the item 
score. Total test scores for all of the scoring methods were obtained by 
summing individual Item scores. 

Several studies using probabilistic response methods have shown the effect of 
a response-style variable, called certainty or risk taking, on scores obtained 
from probabilistic responses. Results from this study showed a small effect 
of certainty v,n the probabilistic scores in terms of the validity of the 
scores but no effect at all on the factor structure or internal consistency of 
the scores. Once the effect of certainty on the probabilistic scores had been 
ruled out, the five scoring formulas were compared in terms of validity, 
reliability, and factor structure. There were no differences in the validity 
of the rcores from the different methods, but scores obtainad from the two 
scoring formulas that were not reproducing scoring systems were more reliable 
and had stronger first factors then the scores obtained using the reproducing 
scoring systems. For practical use, however, the reproducing scoring systems 
may have an advantage because they maximize examinees* scores when examinees 
respond honestly, while honest responses will not necessarily maximize an 
examinee's score with the other two methods. If a reproducing scoring system 
is used for this reason, the spherlca scoring formula is recommended, since 
It was the most internally consistent and showed the stroi^est first factor of 
the reproducing scoring systems. 



Unclassified 



Contents 



Introduction • * - 

Item Weighting Formulas 

Variations of the Response Format of Multlple--Cholce Items j 

Use of Subjective Probabilities with hfciltlple-Cholce Items ^ 

Extraneous Influences on the Use of Subjective Probabilities with 

MultlpIe-KJhoice Items • 

Use of Alternate Item Ty^^s 

Purpose • ^ 

10 

Method • • • 

Test Items ♦ 

Test Administration- * 

Item Scoring « 

Determining the Effect of Certainty 



Evaluative Criteria^ 



Results • » 

Score Intercorrelations 



Discussion and Conclusions « 

- The Influence of Certainty. • 
Choice Among Scoring Methods 
Conclusions 

References,. 

Appendix; Supplementary Tables,. 



ERIC 



14 



14 



Validity and Reliability 

Factor Analyisis of Probabilistic Scores i*> 



18 
18 
21 
22 

23 

26 



Technical Editor: Barbara Leslie Camm 



6 



ERIC 



LfFECT of tXAMINEE CERTAINTY ON PrOBAB I L I S T I C IesT ScORES 

AND A Comparison of Scoring Methods for Probabilistic Kesponses 



Psychometric ians have searched for taany years for a test Item format that 
would allow theo to measure Individual differences on a variable of interest as 
accurately and as completely as possible. The multiple-choice item has proven 
to be a useful tool for assessing kowledge, but there are several problems with 
this Item formal. These problans include the possibility of an examinee guess- 
ing the correct answer, the lack of triforaatlon concerning the process used by 
an examinee to obtain a given answer, and, in general, an Inability to accurate- 
ly determine an examinee's level on a continuous underlying trait based on an 
observable dlchotonous response. ^ 

In attempts to remedy these. problaas and to extract the maximum amount of 
inform.- tian from an individual's responses' to a set of test items. Lord and No- 
vlr- 0968, Chap. 14) have identified thre« important components of interest. 

Tbt^ » components are 

1. The measuresaent procedure, or the manner in which examinees are in- 
structed to respomi to the items. 

2. The Iteia scoring formula. r 

3. The method of weighting each item to form a total store. 

In tlielr attempts to find alternatives to the conventional multiple-choice item 
where the examinee is instructed to choose the one best answer to an item from a 
number of alternatives, investigators have generally focused on one or two of 
these components at a tims. 

♦ 

The various attempts to improve upon the traditio^lal multiple-choice item 
can be classified into three broad categories: (1) attempts to Improve the mul- 
tiple-choice item by using an 1 tea-weighting formula other than the conventional 
unit-weighting scheme, (2) variations of the multiple-choice item that attempt 
to provide more information about an examinee's ability level by asVing the ex- 
aminee to respond to a traditional multiple-choice item in a manner other than 
sln-ply choosing the one best alternative, and (3) the use of item types which 
are completely different from the conventional multiple-clx>ice item, such as 
free-response items. The first category focuses on the third component enumer- 
ated by Lord and Novlck, the item-weighting formula. The second category fo- 
cuses on Lord and Novick'a first two components — the measurement procedure and 
Item-scoring formulas — while continuing to use a unit-weighting scheme to com- 
bine item scores into a total score. The th^rd category focuses priaarily on , 
the measurement procedure' and, to a lesser extent, on item scoring formulas. 

Item-Weighting Formulas 

For many years the accepted method of combining item scores to form a test 
score was simply to sum all of the individual item scores. Since this procedure 
is equivalent to multiplying each item score by an item weight of 1 and then 
summing the weighted itaa scores, the method has been called unit weighting. In 
attempts to increase the validity and/or the reliability of test scores obtained 
by suranlng item scores, many researchers have abandoned unit weighting in favor 
of various forms of differential weighting of individual items. These methods 
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of differential weighting of Items Include multiple regression techniques (Wes- 
man & Bennett, 1959), using the validity coefficient of the Itan as the Item 
weight (Guilford, 1941), weighting items by tne reciprocal of the iton standard 
deviation (Terwilllger & Anderson, 1969), a priori item weights (Burt, 1950), 
and numerous other weighting procedures (Beittler, 1^68; Durtnette & Hogattv 1957; 
Hendrickson, 1970; HoTSt, 1936; Wilks. 1938). 

I 

In reviewing the substantial literature in this area,, Wang and Stanley 
(1970, p. 664) have concluded that "although differential weighting theoretical- 
ly promises to provide substantial gains in predictive or construct validity, in 
practice these gains are often so slight that they do not seen to justify the 
labor involved in deriving the weights and scoring with thea. This is especial- 
ly true when the coiapon«it TCasures are test itens Gulllksen (1950) con- 
cluded, in concurrence with Wang and Stanley (1970), that differential weighting 
is not worthwhile when a test contains more than approximately 10 items and when 
the items are highly correlated. Stanley &im Wang (1970), after concluding that 
differential ttem weighting is not a fruitful venture for test items, have sug- 
gested that the item score be determined by the response made to an item, where 
the examinee is required to do more than just select the correct alternative for 
an item. By changing the mode of response and devising item scoring formulas 
appropriate for these types of responses, the validity and/or reliability of 
test scores might be increased. An additional gain might be more insight into 
the process Involved in responding to test items. 

Variations of the Response Format of Multiple-K^holce Items 

Several of the earliest attempts at modification of the method of respond- 
ing to a conventional multiple- choice item were reported by Dressel and Schmid 
U<i53) in an investigation of various item types and scoring formulas. A con- 
ventional multiple^choice test and one of four "experimental test forms" were 
administered to each subje'.t. The items in each of the experimental test forms 
rtiaembled conventional multiple-choice items In that an item stem and several 
alternatives were provided, but each experimental test form differed frcm the 
conventional multiple-choice format in the following ways: 

1. Free-choice fori nat. Examinees were instructed to choose as many of the 
alternatives provided as necessary to insure that they had chosen the 
correct alternative. This ltec» format was scored rslng Equation I, 
which yields integer scopes that range from -4 to 4 and applies only to 
f Ive-altematlve iteaa: 

Item score = 4C - I I^l 

where C » number of correctly marked alternatives and 
I = number of incorrectly marked alternatives. 

2. Degree-of-cc.Lainty test . Examinees were instructed to choose the one 
"best answer for an item and then to choose one of four confidence rat- 
ings provided to indicate the degree of confidence they h<*d in the an- 
swer they had chosen. This item format was scored as shown In Table 1. 

3. Multiple-answer format . Each itan contained irore than one correct al- 
ternative, and thtf examinees %rere instructed to choose all of the cor- 
rect alternatives. The score for this format was the humber of correct 
alternatives chosen minus a correction factor for any incorrect alter- 
natives chosen. 
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Table 1 

, Scoring Sjrstea for Degree-of -Certainty Test 





Itea 


Score 




Correct 


Incorrect 




Ans%rer 


Answer 


Confidence Rating 


Chosen 


Chosen 


Positive 


4 


-4 


Fairly certain * 


3 


-3 . 


Rational guess 


2 


-2 


Ito defensible basia for choice 


I 


-1 



4. Tvo-ahsver format * „ Each iteai contained exactly t%iO correct alterjia-* , 
tlvepy ami the exaalni^es vere Instructed to . Indicate tK>th of the cor- 
rect alternatives • The item score was simply the number of correct 
alternatives chosen. 

In cooparing thesf five test forms (the conventional sultiple-cholce format 
and the four experimental test formats), Dressel and Schmid's (1953) results^ 
showed that the experimental test formats containing more than one correct al- 
ternative (Formats 3 and 4 'above) exhibited greater internal consistency reli- 
ability than the other three test forms, but these test formats also took longer 
to administer than- all of the other formats. All of the experimental test for- 
mats had higher internal-consistency reliability than the conventional multiple- 
choice test except for the free-choice format, Imt the conventional multiple- 
choice format took less time tlwn any of the experimental test formats. Al- 
though the higher reliability coefficients of several of these formats J Formats 
2, 3, and 4) night suggest that these formats aid in introducing more ability 
variance than error variance, the authors warn that the results must be viewed 
with caut Ion, 0 since there were statistically significant differences between the 
groups taking each experimental form on the standard multlple-cVwice test that 
was administered to all of their eubjects; thus-, the differences attributed to 
the effect of test format might be due to systtsaatic ability differences in the 
groups taking each of the experimental test formats. 

•Hopkins. Hakstian, and Hopkins (1973) used 'a confidence weighting procedure 
similar to the degree-of-certalnty test usl^ by Dressel and Schmid (1953) and 
reported higher split-half reliebll^ty coefficients for the confidence weighting 
format than for a conventional multiple-choice teat using the same items. Hop- 
kins et al. (1973) also reported validity coefficients that were correlations 
between the test scores and a short-answer form of the same test. The validity 
coefficient for the conventional test (.70) was higher but not significantly 
different from that of the confidence weighting -fonaat (.67). 

Coombs (1953) felt that examinees could provide more information about the 
degree of knowledge they possessed by eliminating the alternatives which they 
felt were incorrect, rather than by choosing the one correct alternative. Items 
using this format were' scored by assigning one point for each Incorrect alterna- 
tive eliminated and 1 - K points i^en the correct alternative was eliminated, 
where K is the number of alternatives provided. This scoring system yields a 
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range of Integer Iton scores frcai -3 to 3 for a four-alternative multiple-choice 
^Item. 

In comparing tJtlB test format %iith'a conventional multiple-choice test. 
Coombs, Milholland and Wooer (1956) found no differences in validity between the 
two formats for separate tests of vocabulary, spatial visualization, and^ driver 
Inforroation. The validity coefficients used Were correlations between test 
scores and criteria such as Stanford-Blnet IQ, another test of spatial ability, 
and subtest scores fron the Differential Aptitude Test. For- these same content 
• areas, the experimental te^t format yielded higher reliability estimates than 
the conventional test, but the differences between the estimates were not sta- 
tistically significant for any of the content areas. One result In favor of the 
experimental t^:^t format was that the .subjects In the experiment 'felt the exper- 
imental format to be fairer than the conventional format. 

Another variation upon the conventional multlple-ctolce Item Includes a 
self-scoring metliod advocated by Gllman and Ferry (1972), which requires examin- 
ees to choose among alternatives provided until the correct alternative 'is cho- 
sen. Feedback is given after each" choice is made. The item score is simply the 
- number of responses needed to choose the correct alternative; thus, a higher 
score indicates, less knowledge about an it^. Kanie and Molone/ ( 197A) have 
warned that although Gllman and Ferry (1972) found an increase in spllt-half 
reliability using this technique, the effect of using this method on the reli- 
ability of the test dei!>ends upon the ability of the distractors to discriminate 
between examinees of varying levels of ability. An increase in reliability will 
result when the dldtractors possess this abllltjr to discriminate among ability 
levels, but no Increase in reliability will occur If this is not the case. 

Use of Subjective Probabilities withShgtiple-Choice Items 

A modification of the traditional multiple-choice Item that has generated 
much research and interest is the use of examinees' subjective ppobabilities 
concerning the degree of correctness of each a^^ternative provided for an Item as 
a method of assessing the degree of knowledge or ability possessed by the exam- 
inees. By assigning a probability estimate for eacli alternative to an item, 
examinees can indicate degrees of partial knowledge they may have concerning 
each alternative for an item. 

To simplify this procedure for examinees, a number of mettKids hnve been 
devised to aid examinees in assigning their subjective probabilitie^5 to the al- 
ternatives. One method is to ask examinees to directly assign probabilities 
from 0 to 1.00 to each alternative, with the restriction that the probabilities 
assigned to all of tb:i alternatives for each item sum to 1.00. Another method 
Instructs examinees to distribute 100 points among the alternatives for each 
item. The distributed points are then converted to probabilities for scoring 
purposes by dividing the points assigned to each alternative by 100. Some in- 
vestigators have used fewer points for distribution (Rippey, 1970) or symbols,, 
such as a certain nimber of stars, which are to be distributed among the alter- 
natives (deFlnettl, 1965). but the concept is the same. 

Using these types of measurement procedures (sometimes called probabilistic 
item formats or probabilistic response formats), an item scoring formula had to 
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be devlVed So tliat examinees* expected scores «>uld be maximized only when they 
responded according to their actual beliefs concerning the correctness of each 
alternative. Itaa-scorlng fcraiaas which satisfy these conditions are called 
reproducing scoring, systems (RSS). b'huford, Albert, aad Massenglll (1966) and 
deFlJettl (1965) provide examples of several RSSs. The RSSs presented by these 
two* authors for use with multiple-choice items that have more than two alterna- 
tives and only one correct answer are the following: 
1. Spherical RSS 

Item score « 



'J {1 4 



where p^ - probability assigned to the correct a'lternatlve 

probability assigned to alternative k^, k - tl, 2, .... m) 



c 
Pk 

2. quadratic RSS 



Item score = 2p - ' (p, ^) HI 
^ k^l ^ - 

3. Truncated Logarlchmlc Scoring System 

fl + log(p )» .01 < p^ 5 l.OOj . 
-1 , 0 <_ p^ 1 .01) 

or a^modlf Ipatlon of this scoring function: 



Item score 



( 12 ,+ iog(p ) /21 , .01 1 p^ 1 1.00 I 

( 0 , 0 1 Pc 1 ' 

The truncated logarithmic scoring system Is technically not an RSS, but it does 
have the properties of an RSS for ptobabilltles between .027 and .973. Accord- 
ing to Shuford et al. (1966), when examinees believe that an alternative has a 
probability of being the correct answer less than of equal to .027, their score 
will be maximized by assigning a probability of zero to that alternative^ Al- 
ternatively, when examinees believe that an alternative has a probability great- 
er than or equal to .973, their expected score will be maximized by assigning a 
probability of 1..00 to that alternative. Shuford et al. (1966) stated that "for 
extrane values of (p|^), some Information about the student's d'egree-of -belief 
probabilities is lost, but from the point of view of applications, the loss In* 
accuracy Is insignificant" (p. 137). Note also that che truncated logarithmic 
scoring function is the only one of the scoring formulas that is dependent only 
upon the probability assigned to the correct alternative. 

Total test scores for examinees are obtained" for all of the RSSs by simply 
summing the individual item scores obtained using that particular scoring farmur 
la. In addition to the conditio.is expressed above for an RSS, deFinettl (1965) 
has stated that the validity of any reproducing scoring system also rests upon 
the following assumptions: 




1. 



2. 



3i 



/The examinees are capable of assigning nimerical values to their sub- 
jective probabilities. 

.The Mamlnees are trained In using the response format and understand 
the itorlng system to be used In scoring the lt«». 
The exaqylnees are ootlvated to do their best on the Items. 



Rlppey (19€i^) reported results from several studies comparing test scores 
obtained using the spherical RSS and the modification of the truncated logarith- 
mic scoring functions with test scores obtained by suanilng dlchotomous (0,1): 
item scores to conventional multiple-choice Itens. In general, he found In- 
creases In fkiy.t's reliability coefficient using .a probabfllstlr response format 
with RSSs under limited conditions. The probabilistic test format produced In- 
creases In test reliability with undergraduate college students but could not be 
used with fourth graders and produced no consistent Increases In reliability for 
tests given to high school freshron or medical studo&ts. There were also no 
consistent tendencies for one or the otber of the scoring foisulas for the prob- 
abilistic response format no produce higher reliability eotsf f Idaits. 

Rlppey (1970) compared the reliabilities of five different methods of scor- 
ing probabilistic item re8{K>n8e6. Three of ' these methods were RSSs; the fourth 
was simply the probability assigned to the correct answer, and the fifth was a 
dichotoajous scoring of the probabilistic responses, which resulted In an item 
score of I if the probability assigned to the correct answer was greater than 
the probability assigned to any other alternative aiKi a score of 0 otherwise. 
The three RSSs ufeed were the modification of the truncated log scoring function, 
the spWlcal RSS, and another RSS called the Euclidean RSS. An Item score us- 
ing the Euclidean RSS Is computed^ using the following equation: 



Item score = 1 - 



161 



where pj^ - probability assigned to alternative j^t " 2, W), and Xj^ - 

criterion gro^tp mean probability assigned to alternative 

Using' Hoyt's reliability coefficient, Rlppey found that the test scopes 
obtained by summing the probabilities assigned to the correct answer yielded 
higher average reliability coefficients (.69) than any of the other scoring 
methods and that the dlchotomous scoring of t*ie probabilistic responses yielded 
theTowest 'average reliability of the five methods (.47), although it was not 
much lower than those of the three RSSs 't.49, .50, and •SB). 

i ^ 

In comparing two RSSs (quadratic and the modification of the truncated log- 
arithmic scoring functions) with conventional multiple-choice test scores, 
Krtehler (1971) found no significant differences between Internal consistency 
reliability coef f lclei\ts for the test scores obtained using the ti#o RSSs and the 
te&t scores from the convemtional mtatiple-cholce items. He |oimd evidence of 
convergent validity for both the probabilistic and conventional it^ formats 
and, on the basis of this eviiience, suggested the use of conventional tests, 
since they are "easier to administer, take less testing time, And do not require 
the tralfilng of subjects in the intricacies of the confidence-marking proce- 
dures" (p. 302). However, his conclusions must be viewed with caution, since 
each of his tests consisted of only 10 items. 
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Extraneous Influences on the Use of 

Subjective Probabilities with Multiple-Ch oice Items 

Although Koehler's results oay not be generali«ble due J° f % J^^^, "^l"^" 
of Items administered in each format, the use , J^f P",*'*^^!' Jji^, ^'^ifj™' 
has been "-.estioned for other reasons. Hansen (1971 , "Jf^S^^^^^'^i^' ff*^"!"^ 
(1967). Ecnternacht, Boldt. and Sellman (1972). Koehler (1974). and ^ 
BruPza (1974). alon^ with several others, have investigated the possibility that 
the Increase in reliability demonstrated by probabilistic Item formats is due to 
the effect of a personality variable or response style variable rather than a 
more Accurate assessment of knowledge. This variable^has J^^^Jf^i^ 
called risk taking, certainty, confidence, and cautiousness. If ^J^J^J^ % 
feet of this response style variable that leads to increases in reliability for 
probabilistic respondit« over conventional multiple-choice items, this «"ect 
might also explain the fact that the probabilistic item format »«J "^^'Z" J^"^ 
eral. led to increases in the validity of these test scores over that of test 
scores obtained from conventional multiple-choice items. 

Studies investigating the influence of these varloui personality variables 
have shown n ed results. In studies ^ere conventional multiple-choice item 
sc'es a" : ,babllistlc item scores were obtained (Koehler 1974; ^chternacht 

S^Ucnan. Boldt, & Young. 1971), the -O"^^^^?-^^-^ ""M^T.^^^t^'and ^9 to 

have be^n consistently high (.71 to .83 for the Koehler ^^^74 study and .89 to 

.99 for the Echternacht et al. (1971) study). TMs suggests that a 

portion of the variation in the probabilistic test scores can be ^"O""'^^^^ 

by the conventional test scores. Ihe question being posed, though, i« whether 

ti,e variation in the probaHllstlc test scores that cunnot be 

the conventional test scores is reliable variance due to Increased accuracy of 

assessment of knowledge or due to personality or response style variables. 

To determine the Influence of these personality factors. Koehler (1974) 
embedded seven nonsense items in a 40-item vocabulary test ^ 
that they were not to guess the answers to any items on the test. The nonsense 
l^eL wSeltLs with Z correct alternatives. From responses to these nonsense 
itema^ calculated two confidence measures: ^..4™ 
.Cj -Toportion of nonsense itesa attempted under do-not-guess instructions, ■. 

and 



.■Aj.{'..-=)y{'-i) 



[7] 



where m = number of alternatives. 

n " number of nonsense items, and 
p^y - i-robability assigned to alternative 1. on item 

Since the nonsense items had no correct alternatives, an examinee's respon- 
ses to these items were a pure measure of a response style or personality vari- 
able (confidence) that was influencing that examinee's responses. Responses to 
these items were not due to any knowledge the examinee possessed, since there 
were no correct answers to those items. The greater the deviation of these in 
dices from 0. the higher the level of confidence exhibited by the examinee. 

l3 
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Koehler found that both of these confidence Indices were significantly negative- 
ly correlated with three probabilistic test scores (spherical, quadratic, and 
the modification of the truncated logarltlalc scoring functions), but not sig- 
nificantly correlated with the nvmber-correct scores fron the same it ess. The 
nifflber-correct scores also yielded a higher internal consistency reliability 
coefficient than the three probabilistic scores (.85 versus .82, .80, and .74), 
On the basis of these results, Koehler did not recooiaend the use of probabilis- 
tic response formats, since "it would appear ... that confidence responding 
metlwds produce; variability in scores that cannot be attributed to knowledge of 
subject matter" (p. 4). 

Hansen (1971) obtained probabilistic test scores<=^nd scores on independent 
measures of personality factors such as risk taking and test anxiety. Hi devel- 
oped a measure of certainty in responding to probabilistic response formats 
which is essentially the average absolute deviation of a response vector to an 
item from a response vector assigning equal probabilities to all alteytiatives. 
Hansen's study showed that this certainty index was related to risk taking as 
measured by the Kogan and Wallach Choice Dilemmas Questionnaire and authoritari- 
anism as measured by a version of the F-scale, developed fay Christie, Havel, and 
Seldenberg (1958). Howver, the certainty index did not correlate significantly 
with scores on a test anxiety questionnaire or scores on the Gough-Sanford Rig- 
idity Scale. 

These results provide more information concerning the nature of the re- 
sponse style, but there are probleos with Hansen's (1971) certainty index, which 
he attempts to alleviate but does not. The major problem with this index is 
that it is not a pure measure of certainty. This certainty measure is con- 
found oi by an examinee's knowledge concerning an itaa. Hansen attempted to par- 
tial out examinees' knowledge by using their test scores as a predictor in a 
re>;ression equation to obtain predicted certainty scores. Ihese pr^icted cer- 
tainty scores were then subtracted frcm the observed certainty scores to obtain 
a certainty measure free r^t the influence of examinee knowledge. 

Although the rationale is sound, Hansen did not accomplish what he set out 
to do. The test score he used as a predictor was not a pure or even relatively 
pure measure of knowledge. The test scores were probabilistic test scores corn- 
put^ frcm the spherical RSS. This scoring system results In scores that repre- 
sent a confounding of certainty and knowledge. Therefore, by partlalling these 
probabilistic t€*t scores frtm the certainty index, it is unclear exactly what 
the residual certainty index represents, since both knowledge and some, certainty 
have been partial led out. Hansen's results were then based upon the reratlon- 
shlp of various personality variables with a certainty index confounded with 
knowledge, and the relationship of these same personality variables with a re- 
sidual certainty index whose composition is somewhat ambiguous. Hansen's re- 
sults might best be viewed with caution. 

Pugh and Brunza (1974) conducted a study similar to that of Hansen (1971), 
except that they used a 24-ltem vocabulary test and scored it usli% the proba- 
bility ass^ned to the correct answer as the item score. They also obtained 
scores on an independent nonprobabllisticaily scored vocabulary test, and mea- 
sures of risk taking, degree of external control, aiul cautiousness. They fol- 
lowed Hansen's regression procedure to obtain a certainty measure free of the 
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confounding effects of knowledge and were sore successful than Hansen. They 
used fhe Independent vocabulary test score aa a predictor of the same certainty 
index that Hansen used- and then calculated a residual certainty index by sub- 
tracting the predicted certainty score from the olwerved certainty score* Since 
the Independent vocabulary test was a relatively pure neasure of knowledge, par- 
tialllng its effect irom the observed certainty index restilted in a residual 
certainty index that (1) was a neasinre of the certainty displayed in responding 
to aultiple-choice iteas in a probabilistic fashion and (2) was not related to 
knowledge possessed by exaainees concerning the Iteitf. 

Pugh and Brunza (1974) reported that this residual certainty measure %fas 
not very reliable (.32 internal consistency reliability) and that it correlated 
significantly with riskr-taking scores obtained frcra the Kogan and Wallach Choice 
Dileanas Questionnaire but not with the a^asuxes of cautiousness and external 
control they had obtained. Although this evidence of . the influence of variables 
other than knowledge on probiabilistic test scores might serve as a deterreit to 
the use of these scoring systems, Pugh and Brunsa noted that "there: is no evi- 
dence in either utudy [Pugh & Brunza, 1974, or Hansen, 19711 that these factors 
are more operative than in traditional tests'* (p% 6). 

Echternacht et al. (1971) scored answer sheets of daily quizzes obtained 
from two Air Force training courses using a truncated logarithmic scoring f mic- 
tion and number correct. They found that usli^ the nifflber-correct score, the 
shift of the trainees, and a ntntber of personality variables such as test anxie- 
ty, risk taking, and rigidity as predictors of the probabilistic test scores did 
not account for significantly more of the variation in the probabilistic test 
scores than was accounted for when using only nuaber-correct scores and shift of 
the trainees as predictors. This is evidence that the personality variables did 
not operate to a greater extent in a probabilistic testing situation than in a 
conventional multiple-choice testing situation. 

Thus, these studies show some relationship of probabilistic test scores to 
personality variables (primarily risK?-taklng tendencies); but they also show 
that these influences do not seas to be greater in probabilistic testing situa- 
tions than in conventional testing situations. 

Use of Alternate Itaa Types 

The research reviewed above relied on the multiple-choice item type and 
varied the method of responding to that type of item; however, some researchers 
have advocated the use of entirely different item types, such as free-response 
items, to aid in the assessment of partial imowledge. Some of these alternate 
Item types avoid many of the problems Inherent in multiple-choice Itans but are 
Subject to problems of their own, Fbr example, the free-response Item type 
avoids the problem of randcaa guessing among a 'number of alternatives and has the 
potential to provide a large amount of information concerning what the examinee 
does or does not know, but it is also more tlae-consimilng to fldmlnister and 
score, and may cover much less material than is possible with a multiple-choice 
format. Consequently, if there are any time constraints on testing, fewer Items 
can be administered. Practical problems with scoring many of these alternate 
item types have prevented widespread use of several of them. 
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P urpose , . . ^ 

Although comparisons of the psychometric properties of multiple-choice 
it ens with several alternate item types are planned, the present research fo- 
cused on coa'oarisons of the probabilistic response format^i This study has at- 
tempted to answer the following questions: 

!• Does a personality variable such as certainty affect probabilistic test 
scores on an ability test to a greater degree than it affects conven- 
tional test scores on the same ability test? 

1. If the effect of d personality variable can b6 discounted, what types 
of scoring systems are best for multiple-choice items on an ability 
test requiring probabilistic responses? 

Method 

Test Items 

Thirty multiple-choice analogy items were chosen from a pool of items ob- 
tained from Educational Testing Service (ETS) containing former SCAT and STEP 
items* Each item consisted of an item stem and four alternatives. The pool of 
items had been parameterized by ETS groups of high school students using the 
c^innuter program LOGIST (Wood, Wingersky, & Lord, 1976) with a three-parameter 
logistic model, resulting in item response theory discrimination, difficulty, 
and guessing parameters calculated, from Ic^rge numbers of examinees for each 
item. The 30 items were chosen from a pool of approximately 300 analogy items 
to represent a uniform range of discrimination and difficulty parameters. The 
parameters for the chosen items are in Appendix Table A* The item discrimina- 
tion parameters ranged from approximately ai« .6 to_a" 1*4, with a mean of .975 
and a standard deviation of .244, while the difficulty parameters ranged from 
approximately b -#5 to ^ " 2»5, with a mean of .961 and a standard deviation 
of .86/. The 7ange of difficuity parameters was not chosen to be sjrmmetric 
about zero because the available examinees constituted a more select group than 
the group whose responses were used to parameterize the items. The guessing 
parameters for these items ranged from c^ ^09 to £ - #38, with a mean of .20 
and a standard deviation of •06« 

Test Administration 

The 30 mu] ti pie- choice analogy items chosen were then administered to 299 
prychology and biology undergraduate students at the University of Minnesota 
during the 1979-1980 academic year. Students received two points toward their 
course grade (either introductory psychology or biology) for their partici- 
pation. Items were administered by computer to permit checking of responses to 
be sure that item response instructions were carefully followed. 

The examinees were instructed to respond to each item by assigning a proba- 
bility to each of the four alternatives. This probability was to correspond to 
the examinee's belief in the correctness of each alternative, with the addition- 
al restriction that the probabilities assigned to all of the alternatives for an 
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item sum to one. Specifically, for each ttes^, the examinees were asked to dis- 
tribute 100 points among the four alternatives provided for each item according 
to their belief as to whether or not the alternative was the correct alternative 
for that item. The total number of points assigned to all of the alternatives 
for an Itea h^ to equal 100. Since the tests mre c(»puter administered, xtem 
responses were stamed Iramediately to cmsure that the re^fxinsea to the alterna- 
tives did indeed sim to 100 (sim».o£ 99 and 101 were also considiared valid to 
allow for rounding). The points assigned to each alternative %«re then con- 
verted Into probabilities by dividii^ the response to each alternative by 100. 

To Insure that the exaainees understood both Ik>w to use the conputer and 
how to respond to the multiple-choice items in a probabilistic fashion, a de- 
tailed set of Instructions preceded each test (see Appeidix Table B). If an 
examinee responded Incorrectly to an instruction^ the cosoputer wuld display an 
appropriate error message on the screen and the examinee would have to re- 
spond correctly before^proceeding to the next screen. If an examinee again re- 
sponded Inappropriately to an instruction, a test proctor was called by the com- . 
puter to provide additional help to the examinee in understanding the instruc- 
tions. Several examples and explanations of methods of responding to probabi- 
listic Items were provided. Examinees, with few exceptions, did not have any 
difficulty understanding how to respond to the items. If, in responding to an 
Item, an examinee's responses did not sum to 99, 100, or 101, the exmalnsc wa** 
Immediately asked to reenter his/her responses until an appropriate sum was en- 
tered. 



Item Scoring 

Hhe item responses obtained frcna these 29$ examinels were then scored using 
five different scoring formtilas to determine ?Aich of these scoring formulas 
yielded the most reliable and valid scores. The five different scoring formulas 
used were: 

1. The probability assigned to the correct alternative by the examinee 
(PACA) was used as the item score. This scoring formula yields scores 
that range from 0 to 1.00. 

?. The second type of item score (AIKEN) was computed from a variation of a 
scoring formula developed by Aiken (1970), *Hiich is a' function of the 
absolute difference bet%feen the correct response vector for an item and 
the obtained response vector: 

Item score « 1 - ~ — 18] 

max 



m 

where D = E 



"1 



Pal - Pel 



(91 



m - ntmber of alternatives, ^ 
Pal ' probability assigned to the alternative by the examinee; 

Pel " ^P^ct** probability for alternative; and 

D - maximira value of D, i^ich was 2.00 for all of these iteas. 
max 

Each correct response vector would contain three O's and one 1, while 
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the obtained response vector muld contain four probabilities that sum 
to UOO. Fbr example* for an itea where the secoid alternative was the 
correct alternative, the correct response vector wuld be 0, 1.00, 0, 0. 
A response vector that might have bec^ obtained for this Item Is «20, 
• 60; .209 0. Fbr this obtained^ response vectot the Item score wuld be 
computed as follows: 

Iten. score - 1 - [10-201 ^ U-O0--60| ^ l0-»20| ^ jo^l] 

L 2.00 J 

an 

- TM ■ 1101 

This scoring formula also yields scores that range frc^a 0 to l.OO. 
3« The quairatlc RSS (QIUU))« Is defined by Equation 3. This scoring formu- 
la yields scores that range from *-U00 to 1«00. 

4. The spherical RSS (SPHER) Is defined In Equation 2. This scoring formu- 
la yields scores that range frcm 0 to 1«00. 

5. A modification of the truncated Ic^arltteslc scorli^ function <TLOG)« 
This scoring formula is a good approximation to the ic^aritlnlc Rss. It 
is a very good approximation throi^tuiut most of tt^ possible score 
range, and Is defined by Equation 5. This scoring formula yields scores 
from 0 to UOO* The actual formula used here tp obtain scores via a 
truncated logarltlnic scoring function utilises a scaling factor of 5 
rather than the usual scalli^ factor of 1 or 2. It was necessary to 
increase this scaling factor to maintain a logical progression of 
scores, since the probability assigned to the correct answer for some 
items was as low as .Ol* ^nce the Ic^ of .01 is -*4.6052« the scaling 
factor had to be a 5 (actually only seme nuisb^ slightly, higher than 
4.6052) in order that the scores progress in an orderly fashion frmi 0 
to 1«00 according to the prolmbillty assigned to the correct answer. 
This alleviated the problem of assigning negative scores to exrainees 
who had assigned very cmiall probabilities to the correct answer iriille 
assigning a score of 0 (a higher score) to examinees who had assigned a 
zero probability to the correct answer# The actual TtOG scoring formula 
used is Equation 11» 

5 + log(p ) 

~ — E , •Ol < p < 1.00 

Item score = { * } 111] 

0 , 0 < p < .01 

c ^ 



Total test scores for all of the scoring methods were obtained by summing all 30 
Item scores for each of the 30 It&os* 

Determining the Effect of Certainty 

To determine the effect of an examinee's certainty or propensity to take 



ERiC* is 



risks when responding to larobabilistic iteas, Ifansen's (1971) certainty index 
was coopui.ed for each exarainee using the following formula: 



n " number of items in test, 
fflj » nuaber of alteraatlves for item^, and 

« probability assigned to alternative 1, of itos j, . 

This certainty Index is a function of the absolute difference between the proba- 
bilities assigned to the fotq: alternatives awl .25, averaged over itess. Since 
the prvibabilities assigned to each alternative are dependent upon both an ex are- f 
Inee's knowledge awl his/her level of certainty, this certainty index Is not a 
"pure" i^asure of certainty, but is confounded with knowledge about the Item. 

To detennine the effect of this response style variable, it was first nec- 
essary to obtain a "pure" iseasure of certainty. This relatively pure measure of 
certainty was obtained by scoring the probabilistic responses dichotomously awl 
then partial ling the effect of this knowledge variable out of the certainty In- 
dices. A dlchotooous test score was obtained from the probabilistic riesponses 
by making the assumption that under conventional "clwose-the-correct-answer" 
instructions, examinees would choose the alternative to which they assigned the 
highest probability under the probabilistic instructions. Thus, for each item, 
the alternative assigned the highest probability by the examinee was chosen as 
the alternative the examinee would have clwsen imder traditional multiple-choice 
instructions- A score of 1 was assigned if that alternative was the correct 
answer and a score of 0 was assigned otherwise. Wien more than one alternative 
was assigned the highest probability, one of those alternatives was randomly 
chosen as the alternative the examinee would have chosen. This procedure at- 
tempted to simulate the decision-making process of an examinee in choosing a 
correct answer to an item. 

This dlchotoraous test score* was used in a regression equation to predict 
the certainty index. -SChe predicted certainty index was then subtracted from the 
actual certainty Index to obtain a residual certainty index. This residual cer- 
tainty index constituted a "pure" measure of certainty. This pure certainty 
index was partlalled out of the probabilistic test scores using the same method 
as that used to partial the dichotcRBous test scores out of the origin-U. certain- 
ty index. The pure certainty index was also used to predict ^ne probabilistic 
test score. The predicted probabilistic test score was then subtracted from the 
probalillstlc test score to obtain a residual probabilistic test score that was 
unassociated with the pure certainty index. 

, As a result of these part4alling operations, the following measures were 
available for each of the five scoring methods: 

1, Probabilistic test score . This score represents a confounding of knowl- 
edge and certainty. 

2. Dlchotomous test score. This score represents a pure knowledge Index 
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where 



certainty ^^ndex , 
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and Is the dlchotOBOus scoring of Che probabilistic responses. 
3. Residual score, this score is the probabilistic test score with the 

pure certainty index partlalled out, and thus represents the pure knowl*- 

edge component of the probabilistic scores. 
A. Certainty index . This iseasure represents a confounding of knowledge and 

certainty. 

5. Residual c^rr^tnty index . This measure is the certainty index with the 
pure knowledge li^ex (the dichota8X)us test score) partialled out and 
thus represents a«,pure certainty index. 

Evaluative Criteria 

Reliability and validity coefficients were cooputed^for both the probabi- 
listic and the residual test 'scores, the reliability a>efficients were internal 
consistency reliability coefficients calculated using coefflci<mt alpha, the 
validity coefficients were the correlations between test scote and reported 
grade-point average. For each of the five scoring methc^s used, the validity 
and reliability of the residual scores was cosimred with that of the original 
probabilistic test scores. If therei was any difference between tte validities 
and the reliabilities of the probabilistic and the residual scores, they could 
be attributed to the effect of certainty in responding, since the only differ- 
ence betiiraen the t«K) scores was that the effect oi certainty had been removed 
from the. residual adores. " 

Factor analyses of the item scores (both probabilistic and residual) for 
each of the five scoring formulas were performed uski^ a principal axis factor 
extraction method. The nimiber of factors extracted for each of the scoring for- 
mulas was determined through parallel analyses (Horn, 1965) performed separately 
for each scoring formula, using raiMloaly generated data with the same nimbers of 
items and examinees as the real data and with item diiflcultles (proportion cor- 
rect) equated with the real data. Coefficients of congruence and correlations 
between factor loadings for each of the five scoring formulas were computed. 



Results 

Score Intercorrelatlons 

Correlations between probabilistic test scores, residual test scores, di- 
ctKjtoiBOus scores, the certainty index, awl the residual certainty Indax for each 
of the scoring formulas are presented in Table 1. Since the AIKEN scoring for- 
mula resulted in itoi scores and correlations that were identical to that of the 
PACA scoring fdmula, only the FACA results are rqported. 

As exp€»cted, due to the partiallli^ procedure, the correlation between the 
residual certainty index and the dictotomous score, and the correlation between 
the residual certainty index and the residual score, were both sero for all 
scoring methods. The correlation betiraen the original certainty index and the 
dictujtCMBOus pcore (.71), atui the correlation between the original certainty in- 
dex and the residual certainty index (.71), were exactly the saige for all four 
scoring, formulas. Tlils is diw to the fact that the three indices — the original 
certainty index, the residual certainty index, and the dichotoaous score — do not 



ERIC 



0 



- 15 



T&ble 1 

Intercorrelatlons of Scores for Multiple-Choice It ess with a 
Probabilistic Kesponse Fo rmat Scored by Pour, Scoring Methods 

Scoring Method Probabl- Wctet- Besldual Residual 

and Score llstlc obous Certainty Certainty Score 



Probabilistic .94** .64** -.04 UOO** 



OlclKitoaous 
Certainty 
Residual Certal 
Residual Score 





.94** 


.64** 


-.04 


.91** 




.71** 


.00 


.56** 


.71** 




.71** 


-.12* 


.00 


.71** 




.99** 


,92** 


.65** 


.00 


r triangle) and PACA 


(upper triangle) 




.93** 


.83** 


.24** 


.85** 




.71** 


.00 


.43** 


.71** 




.71** 


-.25** 




.71** 




.97** 


.88** 


.62** 


.00 



.94** 
.67** 

-,00 
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Probabilistic .93** .83** .24** .97** 

Dickiotomous .85** .71** .00 .96** 

Certainty .43** .71** .71** .68** 

Residual Certainty -.25** .(K) .71** — -.00 

Residual Score 

*p < .05 . 
**p < .01 

change with the particular scoring foroida usei4; they are constant for each In- 
dividual across a<#irlng TCthods. These twa significant correlations, along with 
the significant correlations exhibited for each of the scoring fonaulas between 
the certainty Index and the residual score (.65, .67, .62, and .68 for QUAD, 
SPHER, TLOG, and PACA, respectively), show that the original certainty index is 
Indeed related to both "knowledge** as msured by traditional multiple-choice 
tests Xthe dichotoaous ocores) and "certainty" unconfounded with "knowledge" 
(the residual certainty Index). 

The correlations between the probabilistic test scores and the dlchotorioue 
test scores were .91, .94, .85, and .93 for the QUAD, SPHER, TLOG, and PACA 
scoring nethods, respectively. Using approximate significance tests for corre- 
lations obtained fro« dependent asmplea (Johnson & Jackson, 1959, pp. 352-358), 
all of the pairwise comparisons amoQg these correlations were significantly dif- 
ferent from* each other at the .05 level of significance. Practically, the only 
correlation of these four that appears different from the others is that of TIDG 
(.85 as opposed to .91, .94, and .93 for the other scoring metlwds). Squaring 
these four correlations yields the proportion of variance in the probabilistic 
test scores accounted for by the dictotoaous test scores. The squared correla- 
tions are .83, .88, .72, and .86 for the QUAD, SPHER, TLOG, and PACA scoring 
procedures. 

The correlations between the residual certainty index (the "pure" certainty 
measure) and thfe probabilistic test scores were -.12, -.04, -,25, and .24 for 
the QUAD, SPHER, TLOG, and PACA scoring formulas, respectively. The correla- 
tions for the QUAD and SP^R scoring formulas were not significantly different 
from zero at the .01 level of significance and thus do not account for signifi- 
cant amounts of the variance of the probabilistic test scores. Squaring the 
correlations that are significantly different from «ero results in squared cor- 
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relations of .06 for both the TLOG and PACA scoring fonaulas. Thus, certainty 
as measured by the residual certainty Index accounts for no more than 6% of the 
variance of any of the probabilistic test scores* 

^The correlations In Table 1 between the probabilistic test ^scores ami the 
residual scores are very high for all four scoring foraulas (.99, 1.00, .97, and 
•97, for QUAD, SPHER, TLOG, and PACA, respectively) « These correlations are 
highest (.99 and 1.00} for the QUAD and SPHER scoring formulas, whose correla- 
tions between the probabilistic test score aiKi residual certainty Index were not 
sl>;nlflcantly different frcm «ero (-.12^1^ -.04); these correlations squared 
(.98 and 1.00) show that almost all of the variance In the QUAD probabilistic 
test scores, and all of the variance of the SPHER probabilistic test scores, is 
accounted for by the residual scores (representing "knowledge** concemii^ the 
Items) • 

The correlations between thf dlchotomous test scores and the residual 
scores are high and significantly differ&it from zero for all of the scoring 
fomulas (.92, .94, .88, and .96 for QUAD, SPHER, TLOG, AND PACA scoring formu- 
las, respectively). This result Is expected, since both the residual scores and 
the dlchotomous scores are relatively pure measures of knowledge. 

It v^s also expected that the correlations beti^ecai the original certainty 
index and the probabilistic test scores for the various scoring methods would be 
greater than the correlations betiisen this certainty liuiex ai^l the dlchotomous 
scores, since the probabilistic test scores dnd. the original certainty Index 
both represent a confounding of certainty and knowl6dgei i4iile the dlchotomous 
scores are a measure of knowledge less confounded by certainty* this 'occurred 
Only for the PACA scoring metlK>d, which was the only scoring TOthod that was not 
an RSS. The correlation between the certainty index and probabilistic test 
score was significantly greater than the correlation between the dlchot<mous 
score and the certainty Index (.83 vs. 71) for the PACA scoring formula, and was 
significantly less (u^ing the depenilent samples test of significance for corre-* 
latlons and a .05 level of . significance) than .71 (.56, .64 and .43) for the ' 
other three scoring formulas. 

Validity and Reliability ^ 

Table 2 shows the validity and internal consistency reliability coeffi- 
cients for the probabilistic test scores obtained fnm the various methods of 
scoring the multiple-choice items with a probabilistic resfKinse format. The 
validity coefficients were all significantly different fr<M zero but were not 
significantly different fr«a each other, using a dependent samples test of sig- 
nificance for correlation coefficients (Johnson & Jackson, 1959, pp. 352-358) 
and maintaining the experlmentwlse error at a .01 alpha level. 

The reliability coefficients were all significantly different frcmi zero and 
significantly different froa each other (using the Pitman procedure described In 
Feldt, 1980, for testing the significance of differences between coefficient 
alpha for dependent samples using a ,01 significance level). The PACA scoring 
method yielded the highest Internal consistency reliability (.91) followed by 
SPHER (.88), QUAD (.87), and TLOG (.84). 



Table 2 

Validity Oorrelatlons of Test Scores vith 
Reported GFA and Alpha Internal Gonslstency 
Reliability Coefficients for Multiple-^olce Items 
with a Probablfistic Beaponse Fbrwat (N-299) 



Scoring. Validity Reliability 

Method X ?~ 



Dnpartialled Scores 










Quadratic RSS 


.18 


<.001 


.87 


<.001 


Spherical RSS 


.18 


<.O01 


.88 


<.(K)1 


Truncated Log RSS 


.18 


<.001 


.84 


<.001 


PACA 


.17 


<.001 


.91 


<.00l 


Residual Scores 










Quadratic RSS 


.13 


.011 


.87 


<.001 


Spherical RSS 


.13 


.011 


.88 


<.001 


Truncated Log RSS 


.14 


.006 ^ 


.84 


<.0Ol 


PACA 


.12 


.017 


.91 


<.001 



*Probabillt^ of rejecting null hypothesis of no 
significant difference frcm zero. 



Validity and internal consistency reliability coeffici«its for the residual 
scores are also shown in fable 2. The reliability coefficients for the residual 
scores are exactly the sm as the reliability coefficients for the probabilis- 
tic test scores. The validity coefficients for the residual scores wre all 
significantly different frOB zero but not from each other (.01 significance lev- 
el), and these validity coef fifcients were significantly lower (p ^ .05) for the 
residoal scores than for the unpartl ailed probabilistic test scores (.18 vs. .13 
for QUAD, .18 vs. .13 for SPHER, .18 vs. .14 for TLOG, and .17 vs. .12 for' 
PACA). This decrease in the sagnitude of the validity coefficients of the re- 
sidual scores is not due to a restriction in range problaa, since the range of 
scores for the probabilistic test scores was very similar to that of the residu^ 
al scores, as is shown in Table 3^ 



Table 3 

Range of Scores for Probabilistic and 
Residual Test Scores 



Scoring 




Residual 


Method 


Probabilistic 


Quadratic 


27.21 


27.30 


Spherical 


16.57 


16.56 


Truncated Log 


13.14 


12.74 


PACA 


20.69 


20.10 
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Factor Analysis of .. Probabilistic Teat Scores 

Factor analyses of the impartlalled probabilistic and residual test scores 
yielded virtually Identical results; therefore, only the results of the factor 
analyses of the probabilistic test scores are reported here. 



Figures la to Id show the results of the parallel analyses performed for 
each of the scoring nethods (numerical data are in Appendix Table C) • The ei- 
genvalues obtained f roa the principal axeb factor analysis of the random data 
were all low; as eipected, no factor accounted for significantly more variation 
in the items than any other factor. In ccMqwrli^ the eigenvalues of the actual 
data with those from the random data, it is clear that one strpog factor is pre- 
sent for all of the scoring methods. A second factor also appears for each of 
the scoring TOthods with eigenvalues greater than that of the second factor for 
the random data, but the eigenval«» for the second factors of the random and 
actual data are so close that the secomi factor (and third factor for tUKf) for 
the actual data can be considered to be the same strength as a random factor. 
On the basis of these results, one-factor principal axis factor solutions were 
obtained for each of the scoring metl^s and are shown in Table 4. 

The factor loadings in Table 4 are positive and fairly high for all items 
and all scoring formulas, indicating a global factor for each of the scoring 
methods. The magnltwies of the eigenvalues sl^ that this factor accounted for 
more of the variance of the itaa responses for the PACA scoring formula (26%) 
than for any of the other scoring formulas (19.9%, 20.9%, and 17.4% for the 
QUAD, SPHER, and TLOG scorii^ formulas). 

The correlations betwem factor loadings across the 30 items for the vari- 
ous scoring methods are presCTted in the lower left triangle of Table 5, while 
coefficients of congruence are reported In the upper right triangle of Table S. 
The coefficients of congruence are at the maxlmioi of 1.00 for all of the pairs 
of factor loadings and the correlations arnoi^ all of the factor loadings are 
very high, except for the correlation between the factor loadings for the PACA 
and TLOG scoring methods,' which was only .80. The fact that all of the coeffi- 
cients of congruence are equal to the maxirata valjue for this ind««x is due to the 
dependence of this index upon the magnitude and sign of the factor loadings. 
Gorsuch (1974, p. 254) twtes that this index will be high for factors whose 
loadings are approximately he saae size csven if the ji^ttern of lowiings for the 
two factors is not ' he same. 



The evidence concerning the effect of examinee certainty on probabilistic 
test scores suggests that certainty as a response style variable has a small, 
almost negligible effect, on the probabilistic test scores obtained in this 
stu**/. The reliability coefficieits for the five scoring methods were exactly 
the same for the probabilistic and residual test scores, indicating that the 
certainty variable was not contributing reliable variance to the probabilistic 
test scores and was artifically increasing the reliability coefficients. The 
factor structures of the probabilistic test scores and the residual test scores 
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Figure I • 
Eigenvalues from 'Parallel Analysis of Rimlon Data 
and Actual I^ta for QUAD, SPHER, PACA, and TLOG Scoring Methods 
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(c) PACA 
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(d) TLOG 
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Factor LoaiiiigB on Che First Factor 

for Multiple-Choice Itaas irlth a 

Probablllatlc Respomte Fbtaat 
«_ , ^ 



Xtoa 




Scoring Method 


1 
1 


Hxmhet 


QUAD 


SPHER 


PACA 




I <i 


• 418 


• 433 


• 382 


• 49Qi 


2 


.446 " 


•438 


All 
•412 


•493 


3 


. •*39 


•456 


• 409 


• 476 


4 


.439 


•435 


.358 


«526 


5 


,233 


.264 


• 165 


• 347* 


6 


.429 


• 443 


.396 


c* A n 

• 528 


7 


.932 


.158 


.316 


• 412 


8 . 


.424 


.428 


.413 


• 505 


9 


.324 


.354 


.259" 


-469 * 


10 


• 426 


k4I4 


.391 


.500 


11 


.383 ^ 


.377 


.355 


.445 . 


12 


.538 ' 


.529 


.509 . 


.585 


13 


.513 


.513 


• 519 


.566 


' 14 


.444 


•441 


.422 


.483 


l** 


.368 


• 384 


• 341 


.414 


16 


. .465 . 


• 512 


•469 


• 543 


17 


.543 


.537 


.487 


• 586 


.18 


.505 


•484 


.546 


• 509 


• 19 


.316^ 
.483 


.338 


.244 


• 445 


20 


.490 


.492 


en*) 
• 50/ 


21 


.552^ 


.552 


.491 


• 597 


.22 


.544 


.571 


, .518 


• 624 


23 ^ 

24 ^ 


. .498 - 


.503 


.463 


V ^527 


.472 


.505 


.394 


• 553 


25 


.400 


.422 


.380 


• 466 


■26 


.437 


.466 


.406 


• 517 


27 


.514 


.505 


• 508 


.520 


28 


.524 


.515 


.473 


• 571 


29 


.406 


.423 


.349 


•488 


30 


.387 


.453 


.370 


• 514 


Eigenvalue 


5.98 . 


6.27 


5.22 


7.81 



Table 5 

CdVrelatlone (Lover Trlaagle) and Coefficients 

of Congruence (l^per Triangle) BetiM^en 
Factor Loadings Obtained for Fbur Scoring Mettoda 



Scoring 



Method 


QUAD 


SPHER 


TLOG 


PACA 


QUAD 




KOO 


KOO 


1.00 


SPHER. 


.97 




1.00 


1,00 


TLOG 


• 95 


• 92 




1.00 


PACA 


• 90 


• 93 


: .80 
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were also identical. -Che factor stricture and internal cottslatency reliability 
data (which are both based upon tl» intexitea coAelatlons for each scoring 
aethod), indicate ik> effect of the c^tainty variable on probabilistic teat 
scores above and beyond the effect on tlw r«iid«*al "teat scores (i.e.* tte proba- 
bilistic test scores with the "paeT certainty' index partialled out). This lack 
of effect is denonstrated by the extroael!^ bigh correlations between the scores 
derived assumli^ conventional multiple-choice instructlcms (the dichotoaous 
score) and the probabilistic test scorea for all of the scoring methods, studied, 
and by the extremely low. correlations betwecoi the "pure" certainty iindex (the 
^ residual certainty index) and the probabilistic test scores for each scoring 
net hod. Since the dichotorous test scores simulate teating conditions under 
conventional multiple-choice instructions to choose the one correct anawer, 
these' high correlations 8t«gest that the greatest portion of the variability in 
the probabilistic test scorce tor. all of the scoring foxmulaa is not different 
from that present in scores obtained with traditional multiple-choice, tests. 

The validity coefficients did show an effect of the certainty inxtex on the 
probabilistic test scores. The significant decrease in the validity coeffi- 
cients Which occurs when the "pure" certainty index is partialled from the prob- 
abilistic test scores is evidence of some effect of the certainty variable on 
the probabilistic test scores. However, even though the decreaae was signifi- 
cant for all of the scoring formulas, the practical differeice was small. .The 
validity coefficients of the probabilistic. test scores were all low initially, 
since the reported CPA criterion is a complex variable not eaaily predicted by a 
single factor of analogical reasoning. Although reported CPA night not have- 
been a true reflection of actual CPA (although Thcs^son and Weiss, 1980, data 
show u correlation of .59 between thte tw), this invalidity should not have af- 
fected the comparisons made in this study. Additional research utllieing dif- 
ferent criterion measures is recommeiuied to further investigate the generality 
of the results obtali^ed here. 

» 

Other than the sraa^-l effect of"the certainty variable on the validity coef- 
ficients for each of the dcorli^ formulas, there appears to be no effect of the 
certainty vcrlable\on the probabilistic test scores. However, since not all of^ 
the variance in th*^ probabilistic test scores csn be accoimted for by the "pure" 
knowledge anct certainty K^dlces , there aay be some other resronse style variable 
that exerts an influence upon the probabilistic test scores. This influence 
would have to be extreaely small, -though, since the knowledge and certainty in- 
dices accounted for 88X, fif4X, 78X, and 92% of the variance in the scores ob- 
tained ftOTi the spherical, qu^ratic, truncated 1<^, and PACA scoring formulas, 
respectively. 

m 

Cj>olce among Scoring Methods 

I 

The choice mot^ the five scoring methods must be made on the basis of va- 
lidity coeffld^ts, the reliability coefficients, and the factor analysis re- 
sults. Since there were no Klgnif leant differences between any of the validity 
coefficients, these coeffldeits do not provide support for any one scoring 
method. In terms of the reliability coefficients, the PACA (and its equivalent 
AIKEN) scoring formiaa yielded acores having the highest rellabiiity coeffi- 
cients of all of the scoring Kthods. 
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The dependence of both the Internal consistency reliability coEfficient and 
the one-factor solution on the interiten correlation suggests that scores from 
the scoring fonaulas with the highest reliability coeff icietits muld also havc^ 
the stroi^est first factors, and this is exactly lAiat occurred in this study. 
Hypothesizii^ that the factor extracted represcsits verbal ability, it is desir- 
able, that this factor account for as large a proportion of each item* a variance 
as possible. The factor contrilmtion of this first factor was greater for the 
^ two scoring methods that are not reproducing scoring sjrsteriis (PACA and AIKEN) 
than for the three scoring iMthods that are reproducii% scorii^ systcns. 

On the basis of these results* either the PACA or Aiken scoring methods can 
be recommended for use with aultlple*^holce itew with a probabilistic response 
foraat« Since PACA is the simplest of the two methods. It liaight be the prefera- 
ble scoring method* 

4 

Conclusions 

Test scores obtained from the five methods of scoring multiple-choice items 
with a probabilistic response format do not appear to be affected by the re- 
sponse style or personality variable of examinee certainty to a greater degree 
than scores obtained under traditional multiple-choice ins t r wrt ions The scor- 
ing method used does not affect the validity of the test scores but does appear 
to affect the internal consistency of the scores. Test scores obtained, using 
the PACA scoring metlK>d vmre more reliable, simpler to ccmpute, and as valid as 
those obtained from the other scoring methods; therefore, use of the PACA scor^ ^ 
Ing method Is reccmmended for these types of items* 

As a note of caution, howler, one of the three reproduclf^ scorii^ systems 
might have a practical advantage over either the PACA or AIKEN scoring .formiflas. 
In 3 situation where examinees were aware of the scoriqg formula to be usdd and 
where the scores were of some importance to the examinee (as for a classrotw 
grade or selection procedure), the examinees could optimise their test score 
using the reproducing scoring systems only by ritstHimiing according to their ac- 
tual beliefs in the correctness of each alternative, while their total scores 
could be maximized with the PACA scoring formula by assigning the maximum proba- 
bility of 1#00 to the one alternative they tlKiu«ht %»s the correct one# If ex- 
aminees were expected to utilise this strat^y, one of the reprcNiuclng scoring 
systems would be better to use with multiple-choice ltes» with a probabilistic 
response format. Test scores obtained from the spherical reproducing scoring 
system were rare reliable, as valid, Btd showed a strongs first factor than 
scores from the other reproducing scoring systems* Thus, If the practical situ- 
ation requires use of a reproducing scorii^ system, the spherical RSS should be 
used. 
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impend Ix: 
Suppleaentary Tables 



Tahle A 
IRT Iteo Par^BeC«rs for 
Miatiple-Cbolce Analogy Iteos 



Item 
Nuaber 




b 


£ 


310 


.616 


> 

-.483 


.20 


273 


.627 


2.062 


.20 


275 


.652 


1.617 


.21 


286 


.673 


2.407 


.09 


327 


.693 


1.129 


.22 


399 


.722 


.446 


.24 


419 


.750 


2.413 


.16 


278 


,770- 


2.002 ' 


.17 


266 


.815 


1.6TO 


.38 


271 


.828 


1.266 


.09 


268 


.844 


1.036 


.1.7 


392 


• 865 


-.360 


.20 


492 


.914 


-.145 


.12 


331 


.930 


1.352 


.20 


578 


.946 


.271 


.20 


405 ' 


.983 


.739 


.16 


323 


U005 


.828 


.20 


394 


1.006 


-.153 


.20 


277 


1.041 


1.930 


.17 


335 


1.075 


1.525 


.20 


575 


1.098 


.197 


.25 


560 


1.132 


-.D07 


.27 


452 


1.156 


-. 341 


.30 


493 


1.172 


.076 


.26 


576 


1.211 


.633 


.20 


415 


1.234 


1.183 


.24 


322 


1.232 


.960 


.17 


250 


1.288 


.513 


.17 


284 


1.357 


2.232 


.24 


339 


1.608 


1.818 


.17 


Mean 


.975 


.961 


.20 


SO 


.244 


.887 


.06 
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Table B 

Instructions Given Prior to Adnlnistratlon . of Multiple-Choice 
Itess with a Probabilistic Response Fbraat 



Screen 29891* 

That cGBpletes the introductory Information, 

Type "GO" and press "RETURN" for the instructions for 
the first test. 

Scre&t 29842* 

This is a test of word knoifledge. It is probably different 
from other teits you have taken, so it is iaportant to read 
the instruct! na carefully to understand how to anstrar the 
questions. 

Each question consists of a pair of words that have a specific 
relationship to each other, followed by four possible answers 
• consisting of pairs of words. Oae of these four pAlrs of 
words has the same relationship as the first pair of wrds. 



type "GO" and press "RETURN** for an exaaple. 

Screen 29824* 
For exaaple: 

Hot:Cold 

1) Hard:Soft 

2) iforse: Building 
3} Mule: Horse 

4) Yellow: Brown 

Your Job in this test la not to choose the correct answer / 
(the pair of words that has the same relationship as the Hrst 
^pair of words) but to indicate your confidence that each Of 
the four answers is the correct answer. 

Type "GO" and press "RETURN" to continue the instructions. 
Screen 29804* 

You indicate your confidence by distributing 100 points 
anoi^ the four answers. The answer you think is the 
correct one should get the highest nimber of points, and 
the answer you feel is least likely to be the correct ansvrer 
should get the lowest moaber of ^Ints. 

The EK>re certain you are that an answer is 0e correct one, 
the closer your reaponse to that answer sh^tild be to 100. 
The twjre certain you are that an answer 1* NOT the correct 
one, the closer your response for that answer should be to 0. 



-continued on the nejct page- 
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Table B, continued 
Instructions Given Prior to Administration of Nultlple-KSiolce 
It ens with a Probabilistic Response Fbnaat 



If you are c(»pletely cesrtaln that one of the Answers Is the 
correct ansmr, assign 100 to that answer and 0 to the other 
answrs for ttut qiMstlon. If you are ccMqtletely uncertain as 
to uhlch answer Is correct, assign 25 to each of the four 
answers. 

Type "GO** and press **]RETllR{r to continue. 
Screen 29805* 

The numbers you b. strlbute aaot^ the four answers rtust stsa to 
99 or 100. However, you can dlstrllmte the 100 points In any 
way you like, as loi^ as the^ reflect your certainty as to the 
"correctness" of each answer. 

To ansiper a question, type the numbers' you assign to each 
answer In a line In the order In which the uiawers appear In 
the question. Separate each nipber by a coosa. 

Type '*G0'* and press "RETURIT for an example. 

Screen 29825* 

Going back to the ample question: 
Hot: Cold 

1) Hard:Soft 

2) House: Building 

3) Nule:Hor8e 

4) Yellow: Brown 

Suppose a person responded with the following numbers: 
? 80,0,0,20 
This person ms; 

a) fairly stire, but not completely certain, that 
the first answer (Hard: Soft) had the same 
relationship as the pair of words In the 
question and thus uas the correct answer. 

b) completely certain that answers "2" axki "3" 
were NOT the correct choice. 

c) unsure about whether or not the fourth answer 
was the correct answer, but felt that it was 
closer to being an Incorrect answer than the 
correct answer. 

Note that 80 0 + 0 + 20 * 100. 

Type "GO** and press '*R£TURN'' to continue the instructions. 



-continued on next page- 
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^ Table B, continued 

Instructions Given Prior to Adainlstratlon of MultipleKEholce 
Itaas with a Probabilistic Response Bormat 

Screen 29826* 

Let's look at this question once nore: 

Hot: Cold 

1) Hard: Soft 

2) House: gilding 

3) MAle;Horse 

4) Yellow: Brown 

Suppose a person responded with the following nuabers : 
? 33^0,^3,33 

Itiis person was: 

a) ccopletely certain that the second answer was NOT the 
correct answer. 

b) unsure as to t^lch of the resaining answers ms correct 
ai^ felt that any of the re^lnii^ three ansirairs were 
equally likely to be the correct answer. 

Type "GO" and press ''RETURN*' to continue^ tte Instructions. 

Screen 29827* 

As you can see, tli^re is an almost endless variety ^f 
coabinatlons of ntabers that you say use to state your 
confidence in the four possible answers. Ose the eitire 
range of nuabers between 0 and 100 to express yotu: 
confidence. RemoBber also that the ntmbers you assign to 
the foul: answers «ist sub to 99 or 100. 

Please ask the proctor for help if you have any questions.' 

Type "GO" and press "RETURN" whai you ere ready to start 
the test. ' ' 

*Thls line Is for Identification only and was not displayed. 
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Xable C 

Eigenvalues for the First Fifteen Principal Flsctors 
of Beal and Raad«B Dita for Each Scoring Method 



(mp SPttBR TLOC ■ PACA 



Factor 






Real 


KBiraoa 


Seal 


Rand OB 


Real 


Kandon 


1 


6.38 


1.01 


6.67 


1.00 


5.65 


1.02 


8.16 
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